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In  maximum-entropy  spectral  analysis  (MESA),  one  maximizes  the  integral  of  logS(/), 
where  S(f)  is  a  power  spectrum.  The  resulting  spectral  estimate,  which  is  equivalent  to  that 
obtained  by  linear  prediction  and  other  methods,  is  popular  in  speech-processing  applications.  An 
alternative  expression,  —S(f)lo%S(f),  is  used  in  optical  processing  and  elsewhere.  This 
report  considers  whether  the  alternative  expression  leads  to  spectral  estimates  useful  in  speech 
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processing. -We  investigate  the  question  both  theoretically  and  empirically.  The  theoretical  inves¬ 
tigation  is  based  on  generalizations  of  the  two  estimates— the  generalizations  take  into  account 
prior  estimates  of  the  unknown  power  spectrum.  It  is  shown  that  both  estimates  result  from 
applying  a  generalized  version  of  the  principle  of  maximum  entropy,  but  they  differ  concerning 
the  quantities  that  are  treated  as  random  variables.  The  empirical  investigation  is  based  on 
speech  synthesized  using  the  different  spectral  estimates.  Although  both  estimates  lead  to  intel- 
li^ble  speech,  speech  based  on  the  MESA  estimate  is  qualitatively  superior. 
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WHICH  IS  THE  BETTER  ENTROPY  EXPRESSION 
FOR  SPEECH  PROCESSING: 

-SLOG  SOR  LOG  S? 


INTRODUCTION 

Because  the  power  spectrum  Sif)  of  a  band-limited  stationary  process  is  related  to  its  autocorrela¬ 
tion  function  /{(/)  by  a  Fourier  transform,  and  because  it  is  relatively  easy  to  measure  R(t),  many 
spectral  analysis  techniques  start  with  values  of  R  (t).  If  R  (t)  is  known  at  a  set  of  points  or  regions, 
one  class  of  spectral  analysis  techniques  proceeds  by  extrapolating  R(t)  so  as  to  take  on  reasonable 
values  in  the  unknown  regions.  The  extrapolated  autocorrelation  function  is  equivalent  to  a  power- 
spectrum  estimate  by  a  Fourier  transform. 

Perhaps  the  best  known  extrapolation  technique  is  Burg’s  maximum-entropy  spectral  analysis 
(MESA)  [1,2],  in  which  the  power  spectrum  Si/")  is  estimated  by  maximizing 

r 

Jq  logS(/)  df  (1) 

subject  to  the  constraints 

X  +  IF 

S{f)  exp(2irft,/)  df,  (2) 

where  IFis  the  bandwidth  and  where  R  (t,),  r  —  1,  2,  ....  Af.  are  known  values  of  the  autocorrelation 
function.  The  MESA  estimate  of  S(/)  has  the  well-known  all-pole,  autoregressive,  or  linear  prediction 
form,  which  can  also  be  derived  by  various  equivalent  formulations  [3-6].  It  has  become  one  of  the 
most  widely  used  spectral  analysis  techniques  in  geophysical  data  processing  [7-9]  and  speech  processing 
[4,10]. 

'Maximum-entropy  spectral  analysis”  is  also  used  in  image  processing.  In  that  field,  however,  the 
phrase  refers  not  only  to  successful  estimates  produced  by  maximizing  (1)  [11-13],  but  also  to  esti¬ 
mates  produced  by  maximizing  [14-16] 

J*  W 

^  S(f)logSif)  df-  (3) 

Spectral  estimates  based  on  (3)  have  also  been  studied  for  ARMA  and  meteorological  time  series 
[17,18].  Although  there  is  controversy  in  the  image-processing  literature  about  whether  (1)  or  (3) 
yields  the  best  estimates  [16,19],  the  success  of  (3)  in  image  processing  raises  the  question  of  whether 
(3)  might  also  be  useful  in  speech  processing.  We  consider  the  question  in  this  report  and  attempt  to 
answer  it.  As  part  of  our  investigation,  we  also  derive  a  generalization  of  the  estimate  produced  by 
maximizing  (3),  one  that  takes  into  account  a  prior  estimate  of  the  unknown  power  spectrum. 

Our  report  is  organized  as  follows:  In  the  next  section  we  review  derivations  of  the  forms  (1)  and 
(3),  and  we  discuss  theoretical  arguments  for  each  of  them.  We  then  turn  to  an  empirical  comparison. 
Our  approach  is  discussed  in  the  third  section,  and  the  results  are  summarized  in  the  fourth  section.  A 
brief  general  discussion  then  follows  in  the  concluding  section. 
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BACKGROUND 

In  this  section  we  derive  the  different  spectral  estimators  that  result  from  maximizing  (1)  and  (3). 
We  find  that  they  both  result  from  applying  a  generalized  form  of  the  principle  of  maximum  entropy 
[20-22],  but  they  differ  concerning  the  quantities  that  are  treated  as  random  variables.  In  the  case  of 
(I),  the  underlying  random  variables  are  the  coefficients  of  a  Fourier-series  model,  and  the  spectral 
powers  5(/)  are  expected  values.  In  the  case  of  (3),  the  spectral  power  5(/)— suitably  normalized— is 
treated  as  a  probability  density,  and  the  underlying  random  variable  is  the  frequency. 

The  —log  S  Form 

In  deriving  MESA,  Burg’s  approach  was  to  extrapolate  R(t)  in  a  manner  that  maximizes  the 
entropy  of  the  underlying  stochastic  process  [1,2].  This  is  an  application  of  the  principle  of  maximum 
entropy  [20-22].  An  intuitive 'justification  for  such  an  extrapolation  of  R(t)  is  that  it  agrees  with  what 
is  known— as  expressed  by  the  constraints  (2)— while  being  "maximally  noncommittal*  about  what  is 
not  known  [20].  In  particular,  (1)  is  the  entropy  gain  in  a  stochastic  process  that  is  passed  through  a 
linear  filter  with  characteristic  function  Y(f),  where  S(f)  —  I  K(/)P,  as  described  in  Refs.  9  (pp.  412- 
414),  23  (pp.  93-95),  and  24  (p.  243).  If  the  input  process  is  white  noise,  then  the  output  process  has 
spectral  power  density  5(/).  This  suggests  that  the  process  entropy  can  be  maximized  by  maximizing 
the  entropy  gain  of  the  filter  that  produces  the  process.  Thus  (1)  is  maximized  subject  to  the  con¬ 
straints  (2).  The  result  is 

S(f)  - 


where  z  —  exp  (—2irif  At).  This  is  the  familiar  MESA  [2]  or  linear-prediction-coding  (LPC)  [4]  esti¬ 
mate.  The  a,  are  the  inverse-filter  sample  coefficients,  and  <r^  is  the  gain.  Such  derivations  of  (4)  have 
several  mathematical  and  logical  drawbacks  [25].  For  example,  entropy  is  mathematically  ill-behaved 
for  continuous  densities  [26,  pp.31-32].  A  derivation  of  MESA  without  these  drawbacks  arises  as  a  spe¬ 
cial  case  of  minimum  cross-entropy  spectral  analysis  (MCESA)  [25]  and  also  helps  to  expose  the 
difference  underlying  the  choice  of  maximizing  (1)  or  (3). 


Like  MESA,  MCESA  is  an  information-theoretic  extrapolation  of  R(t),  but  it  differs  from  MESA 
in  that  it  accounts  for  a  prior  estimate  of  S(/)  (or  R(t)).  MCESA  is  based  on  the  principle  of 
minimum  cross-entropy  (discrimination  information,  directed  divergence,  Kullback-Leibler  number, 
relative  entropy)  [27-30].  Cross-entropy  minimization  estimates  an  unknown  probability  density  q^(x) 
from  a  prior  estimate  p(x)  and  known  expected  values 

J*  gHx)g,(x)dx  -  g, ,  (5) 

where  r  —  0,  . . . ,  Af.  The  estimate  is  obtained  by  minimizing  the  cross-entropy 


H{q,p)  -  J*q(x)  log  dx 


subject  to  the  constraints  (5)  and 

J*  qix)dx  —  1. 


(7) 


When  p(x)  is  interpreted  as  a  prior  estimate,  cross-entropy  minimization  can  be  viewed  as  a  generaliza¬ 
tion  of  entropy  maximization  [28]— cross-entropy  minimization  reduces  to  entropy  maximization  when 
p(x)  is  uniform.  When  p(x)  is  interpreted  as  an  "invariant  measure"  as  in  [30],  the  two  principles  can 
be  viewed  as  equivalent.  In  either  case,  the  resulting  estimate  of  qHx)  has  the  form  [27,29,31] 
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fl(x)  -  pix)  exp 


u 

5 


-X  -  Ti8r/r(*) 


(8) 


where  the  /3,  and  X  are  Lagrangian  multipliers  determined  by  (5)  and  (7).  We  refer  to  /7(x)  as  a  prior. 


In  deriving  MCESA  we  consider  time-domain  signals  of  the  form 

N 

sit)  —  2<r*cos(2ir/^r)  -I-  i*sin(2rr/*r)  , 


(9) 


k-l 


where  the  Ok  and  are  random  variables  and  where  the  are  nonzero  frequencies.  Since  any  station¬ 
ary  random  process  git)  can  be  obtained  as  the  limit  of  a  sequence  of  processes  with  discrete  spectra 
(32,  p.  36],  (9)  is  quite  general.  With  suitable  choices  for  the  frequencies  and  amplitudes,  the  mean 
square  error  £(|^(r)— s(r)P)  can  be  made  arbitrarily  small.  Since  the  power  at  frequency  /*  is 
X*  =  we  describe  the  random  process  in  terms  of  a  joint  probability  density  qix),  where 

X  -  xi,  Xj,  . . .  .  *Ar  ■ 


Let  S/  be  the  spectral  power  at  frequency  of  some  unknown  process  g^x): 

•S'/-/  x^gHx)  dx. 


(10) 


Furthermore  let  be  a  prior  estimate  of  S/.  As  a  form  for  the  prior  estimate  of  the  probability  den¬ 


sity  g’,  we  assume 


N 


(11) 


This  assumption  is  consistent  with  the  prior  spectral-power  estimates,  since  J x^pix)  dx  “  Pk,  and  it 
is  equivalent  to  a  Gaussian  prior  assumption  for  the  amplitudes  Ok  and  bk  in  (9)  [25].  Suppose  that 
one  obtains  new  information  about  g^  in  the  form  of  M  -I- 1  values  of  the  autocorrelation  function 
Rit,):  +s 

(l^)  -  T  5/  exp  ilnitrfk) 
k—N 


N 


T25/ cos  ilvtrfk), 

k-l 


where  r  -  0,  ....  M.  and  to  —  0.  Using  (10),  we  write  this  as 


^-/|l 


2x*  cos'  (2ir^/*) 


g^ix)  dx. 


(12) 


(13) 


which  has  the  form  of  expected-value  constraints  (5).  Given  the  prior  (11)  and  the  constraints  (13), 
one  can  compute  a  minimum  cross-entropy  posterior  estimate  ^(x)  of  the  form  (8).  The  result  can  be 
written  [25]  as 


«(x) 


1  , 

exp  -  -^  +  Uk  X* 


(14) 


where 


M 


“*  -  ^2^,  cos  (2ir /,/*). 


f-O 


The  are  Lagrangian  multinliers  determined  by  the  constraints  (13).  The  posterior  estimate  of  the 
power  spectrum  is  just  S*  —  J  Xkg  ix)  dx,  which  becomes 

1 


Sk~ 


*  1  M 

—  +  T  2)3,  cos  (2?r/,/*) 
Py  rto 


(15) 
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where  the  are  chosen  so  that  the  satisfy  the  autocorrelation  constraints 

K  -  £25*  cos  (IntJk)-  (16) 

k-\ 

If  one  assumes  a  flat  prior  estimate  of  the  prior  spectrum,  Pi^  —  P,  and  equal  spacing  of  the  autocorrela¬ 
tion  lags,  Jr  —  rAA  (IS)  can  be  written  in  the  form  (4)  [25]. 


From  the  foregoing  we  see  that  the  form  (4)  results  from  treating  the  Fourier  power  variables 
Xh  •  -I-  b^)  as  random  variables  and  applying  cross-entropy  minimization  to  the  probability  den¬ 

sity  qix).  The  same  results  are  obtained  if  one  treats  the  Fourier  amplitudes  and  themselves  as 
the  random  variables  [25,33].  To  see  the  relationship  between  the  maximization  form  (1),  the  spectral 
estimate  (4),  and  the  underlying  density  ^(x)  more  directly,  note  that  the  posterior  probability  density 
(14)  can  be  expressed  in  terms  of  the  posterior  spectral-power  estimates  (15): 


9(x)-n-Lexp 

*-I 


S, 


(17) 


Computing  the  normalized  differential  entropy  of  the  posterior  power  estimates  (15)  yields 


^  g(%)  log  g(x)  dx“  I  + 


1  ^ 


(18) 


Except  for  the  constant,  which  has  no  effect  on  maximization,  the  right-hand  side  of  (18)  is  the 
discrete  form  of  (1).  Maximizing  (18)  subject  to  the  constraints  (16)  leads  again  to  (4). 


The  -S  log  S  Form 

Results  from  the  preceding  subsection  show  that  the  MESA  or  LPC  spectral  estimate  (4)  is  the 
result  of  applying  maximum  entropy  or  minimum  cross-entropy  with  the  coefficients  of  the  underlying 
Fourier  series  (9)  treated  as  random  variables.  In  arguing  for  the  maximization  of  -I*  Sk  log  S* 
rather  than  ^  log  S)..  Skilling  [16]  points  out  that  the  goal  is  to  estimate  the  power  spectrum  itself, 
not  the  Fourier  amplitudes  in  an  underlying  model  like  (9),  so  that  a  more  direct  and  better  estimate 
should  result  from  treating  the  unknown  power-spectrum  variables  5/  as  probabilities.  Mathematically 
this  is  reasonable,  provided  that  the  power  spectrum  is  normalized  so  that  ^5/  —  1.  The  difference  in 
the  two  approaches  is  illustrated  well  by  (10).  In  the  first  approach,  one  assumes  that  the  5/  are  expec¬ 
tations  of  an  underlying  probability  density  qHx),  and  one  expresses  the  known  autocorrelations  as 
expectations  of  qHx)  as  in  (13);  it  follows  from  the  preceding  subsection  that  one  should  maximize 
In  the  second  approach,  one  assumes  that  the  5/  are  probabilities,  and  one  expresses  the 
known  autocorrelations  as  expectations  of  the  probability  distribution  5/,  k  I,  ,  N,  as  in  (12) 
(we  defer  for  the  moment  details  concerning  correct  normalization);  it  follows  that  maximum  entropy 
implies  the  maximization  of  -  ^  5;^  log  S^. 


In  deriving  the  power  spectrum  estimate  that  results  from  maximizing  logSjt,  we  proceed 

with  the  general  case  involving  a  prior  estimate  and  cross-entropy  minimization  as  in  the  previous  sub¬ 
section.  Since  we  assumed  a  known  autocorrelation  for  lag  Jq  —  0,  '/iRo  is  known.  Let 

(?/>  qI'  •••  >  itnd  P  ~  {pi>  P2>  ■■■  >  Ps)  be  probability  distributions  defined  by  normalizing 
the  power  spectra  S’/  and  /**: 

T 


Qii 


Pk  - 


where  Ph  is  a  prior  estimate  of  S/,  and  where 


IX 

.5' 
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We  rewrite  the  autocorrelation  constraints  (12)  as  expectations  of  q 


2R  N 

-  £  2  cos  (IntJk)  q^. 


(19) 


Then  we  obtain  a  posterior  estimate  of  q  ^  by  minimizing  the  cross-entropy 


N 


Hifi.p)  -  Vq*  log  — 


subject  to  the  constraints  (19).  Note  that  the  constraint  for  r  —  0  reduces  to  the  normalization  con¬ 
straint  1.  The  result  is 


q*  -  p„  exp 


M 


cos  (2irr,/*) 

r-O 


(20) 


where  the  fi,  are  chosen  to  satisfy  the  constraints.  We  define  the  posterior  power-spectrum  estimate  as 
Sic  —  which  satisfies  (12). 


We  have 


N 


log  -  '/i RoH(q.p)  +  '/> Ro  log 


k-l 


where  '^/?o  and  '/iRo\og(Ro/2T)  are  constant  and  ^ARo  >  0.  It  follows  that  minimizing  ^(q,p)  is 
equivalent  to  minimizing 


(21) 


*-i 


Minimizing  (21)  subject  to  the  constraints  (12)  yields 

M 

’2^2^,  cos  (2irr,/*) 


S*  -  P*  exp 


(22) 


where  the  are  chosen  to  satisfy  the  constraints.  The  in  (22)  are  equal  to  the  p,  in  (20)  except  for 
which  satisfies  ~  l08  iRtJ'i-T). 


For  a  flat  prior  estimate  —  P,  minimizing  (21)  is  equivalent  to  maximizing 

N 

-  log  5*, 


which  is  just  the  discrete  form  of  (3).  Spectral  estimates  based  on  the  minimization  of  (21)  have  been 
reported  recently  in  Ref.  34.  Also,  a  first-order  approximation  of  the  estimate  (22)  appears  to  be 
equivalent  to  the  PDFT  estimator  introduced  in  Refs.  35  and  36. 


Summary 


Here  we  summarize  the  final  results  for  the  two  spectral  estimates.  Both  estimates  proceed  from  a 
prior  estimate  P^  and  known  autocorrelations  R,.  When  the  coefficients  in  an  underlying  Fourier-series 
model  are  treated  as  random  variables  and  the  Sh  are  treated  as  expectations,  cross-entropy  minimiza¬ 
tion  leads  to  the  estimate 


1 


M 


(24) 


+  £2/3*  cos  (27rf,/*) 
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For  the  case  of  a  flat  prior  estimate  (24)  follows  from  maximizing  log  5*.  When  the  S*  are 

treated  as  probabilities  rather  than  expectations,  cross-entropy  minimization  leads  to  the  estimate 


Sk  -  Pk  exp 


M 

cos  (Ivtrfk) 

r-0 


(25) 


For  the  case  of  a  flat  prior  estimate,  (25)  follows  from  maximizing  log  Sic-  Because  the  result 

in  this  case  arises  from  performing  maximum  entropy  on  a  probability  distribution  deflned  by  normaliz¬ 
ing  a  power  spectrum,  we  refer  to  it  as  maximum-entropy  normalized  spectral  analysis  (MENS A).** 


The  Lagrangian  multipliers  /S,  in  (24)  and  <t>r  ^re  chosen  in  both  cases  so  that  the  esti¬ 

mates  agree  with  the  known  autocorrelations 

N 

Rr~  £25*  cos  (2iTt,fii),  (26) 

k-\ 

where  r  —  0,  1,  . . .  ,  M.  Note  that,  given  one  of  the  spectral  estimates  through  Sn,  substitution  of 
an  arbitrary  lag  t  for  in  (26)  deflnes  the  corresponding  extrapolation  of  the  known  autocorrelations. 


Which  of  the  two  estimates  (24)  and  (25)  is  better?  In  our  opinion,  if  one  has  a  good  physical 
model  for  some  variable  of  interest,  and  if  the  model  can  be  incorporated  into  the  derivation  of  an  esti¬ 
mate  for  that  variable,  it  makes  sense  to  do  so.  Because  such  estimates  can  exploit  more  information 
than  estimates  derived  without  an  underlying  model,  estimates  based  on  underlying  models  should  be 
better.  Furthermore,  although  we  recognize  that  normalizing  the  S'*  and  treating  them  as  probabilities 
is  mathematically  sound,  we  do  not  see  any  reasonable  physical  interpretation.  What  events  have  pro¬ 
babilities  proportional  to  S*?  This  suggests  that  (24)  is  better.  Also,  since  (24)  yields  all-pole  models 
in  the  important  case  of  flat  priors,  since  all-pole  spectra  result  from  passing  a  broadband  signal  through 
a  multilayered  transmission  medium,  and  since  the  human  vocal  tract  is  a  multilayered  transmission 
medium,  it  follows  that  (24)  should  be  appropriate  for  speech  processing.  On  the  other  hand,  argu¬ 
ments  for  (25)  also  have  merit,  and  it  is  clear  that  the  best  method  of  answering  the  question  is  empiri¬ 
cal.  This  we  attempt  to  do  in  the  remainder  of  this  report. 


EXPERIMENTAL  APPROACH 


In  this  section  we  present  basic  definitions,  discuss  our  experimental  approach,  and  discuss  vari¬ 
ous  computational  issues.  Our  general  approach  is  to  process  various  speech  signals  in  order  to  compare 
measured  power  spectra  and  autocorrelations  with  MESA  and  MENSA  estimates.  We  also  synthesize 
speech  using  both  MESA  and  MENSA  power-spectrum  estimates  and  qualitatively  compare  the  results. 

Definitions  and  Notation 

Let  y  =  [y,,  y2,  ....  y/,}  comprise  L  time-domain  samples,  equispaced  at  intervals  of  At,  from 
one  "frame”  of  speech  data.  From  y  we  compute  estimated  autocorrelations  R  =  {/{q.  ■  •  ■  > 

by  means  of 

L-r 

Rr~TYy>yi*r-  (27) 

This  is  a  biased  estimate,  but  it  guarantees  positive-definiteness.  Let  Q  =  {Qi,  Qi,  ...  ,  Qn)  be  the 
power  spectrum  defined  by  the  discrete  Fourier  transform  of  the  measured  autocorrelations.  Defining 
/?_,  —  R„  we  have 


**This  somewhat  contrived  acronym  has  the  additional  virtue  of  being  the  Latin  word  for  table,  which  is  the  source  of  the  Span¬ 
ish  word  for  uble  (mesa). 
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i-1 

0*  “  ^  exp  (-liritrfi,) 

f— L+l 

£-1 

~Ro+'^  2Rr  COS  {lit  tJk)  •  <28) 

As  the  N  discrete  frequencies  we  take 

/*  -  a  -  >/i) 

That  is,  we  divide  the  interval  from  zero  to  the  Nyquist  frequency  into  subintervals  and  dehne  fjc  as 
the  midpoint  of  each  subinterval. 

Let  S  =  {5i,  S2,  ....  Sat)  be  the  power-spectrum  estimate  obtained  from  (24)  using  a  flat  prior 
estimate  and  the  first  M  +  I  autocorrelations  R,  from  (27).  S  is  the  standard  MESA  or  LPC  estimate 
of  the  power  spectrum— its  usual,  continuous-frequency  form  is  given  by  (4).  Let 
S’  =  {S|*.  Sj.  ....  Sj!,]  be  the  MENSA  power-spectrum  estimate  obtained  from  (25)  using  the  e 

flat  prior  estimate  and  the  same  M+l  autocorrelations  from  (27).  Finally,  let  A  and  A*  be  the  ''  a- 

polated  autocorrelations  for  all  L  lags  r,  —  rA/,  r  —  0,  ....  L  —  1,  obtained  from  (26)  using  S  ai 
respectively.  Note  that  A,  and  A,  match  the  actual  autocorrelations  (27)  for  r  —  0.  ....  M.  n 

r  >  M,  however.  A,  and  A^  are  in  general  different  from  each  other  and  from  R,.  For  convenienc  : 

summarize  the  notation  as  follows: 

y  a  vector  of  L  time-domain  samples  from  one  speech  frame, 

R  *  the  measured  autocorrelations  for  L  lags  computed  from  y , 

Q  »  the  "actual*  power  spectrum  defined  by  a  Fourier  transform  of  R , 

S  —  a  MESA  or  LPC  estimate  of  the  power  spectrum  from  first  M  +  1  lags  of  R , 

S*  —  a  MENSA  estimate  of  the  power  spectrum  from  first  il/  +  1  lags  of  R , 

A  a  MESA  or  LPC  autocorrelation  extrapolation  based  on  S , 

A*  ~  a  MENSA  autocorrelation  extrapolation  based  on  S’. 

For  the  work  reported  here,  we  used  L  »  180  and  A/  —  8,  10,  25.  When  we  refer  to  more  than  one 
speech  frame,  we  add  a  subscript  to  the  foregoing  definitions. 

What  and  How  to  Compare 

Much  work  in  speech  analysis  and  synthesis  uses  S  to  model  the  power  spectrum.  We  are 
interested  in  testing  the  hypothesis  that  using  S*  would  lead  to  better  results.  To  obtain  information 
that  could  help  to  confirm  or  refute  the  hypothesis,  we  did  three  things:  (a)  For  a  variety  of  representa¬ 
tive  speech  frames  we  plotted  A,  A’,  and  R  and  we  performed  qualitative  and  quantitative  compari¬ 
sons.  (b)  For  the  same  frames  we  plotted  S,  S’,  and  Q  and  performed  qualitative  comparisons,  (c)  We 
performed  qualitative  comparisons  of  speech  synthesized  two  different  ways:  we  used  identical  pitch  and 
voicing  decisions  and  used  either  S  or  S’  for  spectral  shape. 

What  about  quantitative  comparisons?  For  some  distortion  measure  d,  one  could  compare 
diQ.S)  with  </(Q,S’),  but  what  is  the  right  choice  for  </?  Clearly,  one  distortion  measure  could  yield 
</(Q,S)  <  </(Q,S’),  while  another  could  yield  the  reverse  inequality.  One  reasonable  choice  is  the 
Itakura-Saito  distortion  dis  [37],  which  is  known  to  be  useful  in  speech  processing; 
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But  in  the  notation  of  our  background  section  the  Itakura-Saito  distortion  d/siS.P)  is  just  the  asymp¬ 
totic  cross-entropy  between  ^(x)  and  p(x)\  hence,  derivations  of  MESA  or  LPC  spectra  by  cross¬ 
entropy  minimization  are  equivalent  to  derivations  by  minimization  of  Itakura-Saito  distortions 
[38,25,10],  Not  only  does  S  minimize  //..jIS.P)  subject  to  the  constraints,  but  S  is  the  spectrum  of 
the  form  (15)  that  minimizes  d/siQ.S)  [37,39],  Use  of  the  Itakura-Saito  distortion  might  therefore 
involve  an  intrinsic  bias  in  favor  of  MESA,  That  is  not  to  imply  that  djs (Q.S)  is  necessarily  less  than 
(Q.S*),  However,  we  wish  to  avoid  relying  entirely  on  a  distortion  measure  that  relates  specially  to 
the  mathematics  of  one  or  the  other  of  the  two  estimates. 


We  therefore  consider  a  distortion  measure  that  bears  a  relation  to  MENSA  analogous  to  that  of 
d/s  to  MESA.  We  define  the  "cross-entropy  distortion"  to  be  the  cross-entropy  of  the  proba¬ 

bility  distributions  obtained  by  normalizing  Q  and  S ; 

\  Q.  Q. 

dcEiQ.S)  =  log  ^  -  log 

7-1  7-1 

Then  S’  minimizes  ^/cffS’.P)  subject  to  constraints  just  as  S  minimizes  d/s(S,¥)  subject  to  con¬ 
straints.  Moreover  S’  is  one  of  the  spectra  of  the  form  (22)  that  minimizes  rfc£(Q,S’)  [29],  We  use 
dcE  as  well  as  d/s  for  quantitative  comparisons;  that  is,  we  compare  rfcf^Q'S)  with  dcs^Q.^'^)- 


We  also  use  a  third  distortion  measure,  the  gain-optimized  Itakura-Saito  distortion  [39]  defined  by 
dao  (Q.S)  -  min  disig  Q.S),  where  g  ranges  over  positive  constant  scale  factors.  This  is  closely 

g 

related  to  but,  like  dee,  is  insensitive  to  changes  in  the  gains  of  the  two  spectra.  It  can  be  computed 


,(Q,S) 


1  i,,  a 

n .? X 


Numerical  Issues  and  Procedures 

The  MENSA  estimate  S’  can  be  produced  by  an  algorithm  that  determines  minimum  cross¬ 
entropy  probability  distributions  given  arbitrary  priors  and  arbitrary  constraints.  The  mathematics 
underlying  a  Newton-Raphson-based  algorithm  is  discussed  in  Appendix  A  of  Ref.  29,  and  an  APL  pro¬ 
gram  that  implements  this  algorithm  is  described  in  detail  in  Ref.  40.  For  the  work  reported  here,  we 
used  a  FORTRAN  version  of  the  APL  program.  The  resulting  S’  may  be  thought  of  as  a  discrete- 
frequency  approximation  to  a  continuous  power  spectrum,  one  that  is  defined  by  the  maximization  of 
—jS(f)  log  S(J')df  rather  than  -  log  5*.  The  accuracy  of  the  discrete-frequency  approximation 
will  depend  on  the  number  of  frequency  points  N.  Although  it  would  better  to  use  an  algorithm  that 
produced  a  continuous  representation  of  the  MENSA  estimate,  we  do  not  have  such  an  algorithm. 

As  for  S,  a  variety  of  methods  are  available.  Standard  MESA  or  LPC  methods  can  produce  the 
inverse-filter  coefficients  used  in  (4)  or  any  of  the  equivalent  sets  of  parameters  such  as  reflection 
coefficients.  The  result  is  a  continuous  representation  of  the  spectral  estimate  that  can  then  be 
evaluated  at  the  frequencies  /*  in  order  to  yield  S .  No  doubt  this  is  more  accurate  than  a  method  that 
computes  a  discrete-frequency  approximation,  but  to  use  it  might  introduce  a  misleading  source  of 
differences  between  S  and  S’.  To  avoid  such  a  problem,  we  chose  to  compute  the  S  in  a  manner 
analogous  to  the  computation  of  S’.  In  particular  we  used  a  FORTRAN  implementation  of  the 
MCESA  [25]  algorithm  described  in  Ref.  41.  This  algorithm  uses  the  Newton-Raphson  method  to 
compute  (24)  for  arbitrary  priors  and  autocorrelation  constraints.  For  a  flat  prior,  the  result  is  just  a 
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discrete-frequency  approximation  to  a  continuous  MESA  or  LPC  spectrum.  As  checks  on  the  discrete- 
frequency  computations  of  S*  and  S,  we  obtained  results  for  various  values  of  the  number  of  fre¬ 
quency  points  N,  and  we  compared  the  results  for  S  with  continuous  frequency  results  obtained  using 
Levinson  recursion. 

In  considering  how  to  obtain  synthetic  speech  using  the  two  different  spectral  shapes,  we  decided 
to  take  advantage  of  commonly  available,  LPC-based  programs.  This  approach,  which  Is  ideal  for 
MESA  spectra,  involves  exciting  an  all-pole  filter  with  either  white  noise,  for  unvoiced  sounds,  or  a 
periodic  pulse  train,  for  voiced  sounds.  For  MENSA  spectra,  which  do  not  have  the  all-pole  form,  we 
had  to  proceed  indirectly.  Our  procedure  was  as  follows:  First  we  analyzed  the  test  sentence  for  pitch 
and  voicing  using  a  modified  cepstral  technique  described  in  Ref.  42  and  implemented  in  Version  4.0  of 
the  Interactive  Laboratory  System  (ILS)  from  Signal  Technology,  Inc.  The  results  were  used  for  both 
syntheses.  For  the  synthesis  based  on  S*  we  used  a  29th-order  all-pole  approximation  to  the  power 
spectrum  S*  in  each  frame.  This  approximation  was  computed  by  taking  the  first  29  lags  of  the  auto¬ 
correlation  extrapolation  A*  and  using  Levinson  recursion  to  yield  a  set  of  reflection  coefficients.  As 
checks  we  plotted  the  resulting  approximate  power  spectrum  and  compared  it  with  A*.  For  the  syn¬ 
thesis  based  on  S  we  followed  the  same  procedure—we  ran  Levinson  recursion  on  the  first  29  lags  of 
A .  Had  we  been  dealing  with  exact,  continuous  spectra,  the  resulting  "approximate”  spectrum  would  be 
exactly  equal  to  S,  so  it  would  have  been  reasonable  to  bypass  this  step.  We  included  it,  however,  to 
keep  the  comparison  as  fair  as  possible.  As  a  check  we  also  synthesized  speech  using  spectral  shapes 
obtained  directly  from  Levinson  recursion  on  the  first  M  +  1  lags  of  the  measured  autocorrelations  R . 
Note  that  the  29th-order  all-pole  synthesis  spectra  are  29th-order  approximations  to  S  and  S*  and  not 
29th-order  approximations  to  Q . 

EXPERIMENTAL  RESULTS 

We  obtained  results  for  the  sentence  "The  meeting  begins  at  four  p.m"  The  sentence  was  spoken  by 
a  mate,  passed  through  an  antialiasing  filter,  digitized  at  SOOO  samples  per  second,  and  divided  into  100 
frames  of  180  samples  each.  Using  256  discrete  frequencies  (iV  =  256),  we  computed  Ry,  Q y,  Sy,  Ay, 
Sy,  and  Ay,  y  —  1,  ...  ,  100,  as  discussed  in  the  preceding  section.  We  also  did  computations  for 
some  cases  with  =  64  and  N  =•  128.  In  general  there  were  no  essential  differences  between  results 
for  iV  =  64  or  128  and  N  =  256.  It  is  a  frequent  practice  to  preprocess  speech  samples  before  the  auto¬ 
correlations  are  estimated— the  y,  in  (27)  are  the  result  of  preemphasizing  or  windowing  the  speech 
samples.  We  therefore  repeated  the  computations  using  Hamming  windowing  alone,  90%  preemphasis 
alone,  and  both  together.  In  the  following  we  focus  attention  on  two  frames:  frame  56,  which  contains 
a  portion  of  the  phoneme  ///,  and  frame  39,  which  contains  a  portion  of  the  phoneme  ///.  For  con¬ 
venience  we  refer  to  these  frames  by  means  of  the  subscripts  /and  /  respectively.  Unless  windowing  or 
preemphasis  is  explicitly  mentioned,  the  reference  is  to  the  spectra  computed  without  preprocessing. 

Comparison  of  Autocorrelation  Extrapolations 

In  Fig.  1  we  plot  R y^,  Ay,  and  A y  for  —  256.  When  we  plotted  the  continuous  autocorrelation 
function  obtained  by  Levinson  recursion,  it  was  indistinguishable  from  A  y,  which  implies  that  the 
discrete  frequency  approximations  are  accurate.  Beyond  the  constraint  limit  of  lag  10,  the  extrapolations 
Ay  and  A y  differ  from  each  other  as  well  as  from  R.  One  would  be  hard  pressed  to  argue  that  either 
one  is  a  "better"  extrapolation.  The  same  conclusion  follows  from  Fig.  2,  in  which  we  plot  analogous 
results  from  the  frame  containing  a  portion  of  the  phoneme  ///. 

Comparison  of  Power  Spectra 

Turning  to  the  power  spectra,  we  plot  Sy,  S y.  S/,  and  S/  in  Figs.  3  through  6  for  A/  —  10.  The 
spectra  Sy  and  S y  are  quite  similar;  S /  and  S /  are  quite  different.  In  particular,  S )  has  deep  nulls  that 
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are  characteristic  of  the  MENS  A  estimates  for  the  entire  test  sentence.  Indeed  Fig.  7  shows  the  super¬ 
imposed  results  of  S*  for  all  100  frames  (N  “  256).  The  frequent  occurrence  of  five  lobes  is  obvious. 
No  such  structure  occurs  for  S  (Fig.  8).  The  lobe  structure  appears  to  be  related  to  the  number  of  con¬ 
straints:  There  are  five  lobes  in  Fig.  7,  which  is  half  the  analysis  order  (M  —  10).  We  repeated  the 
computation  of  A*  using  Af  —  25  and  A/  —  8.  The  resulting  plots  were  similar  to  Fig.  7  except  that 
about  12  and  four  lobes  were  apparent  respectively.  Neither  preemphasis  nor  windowing  was  entirely 
effective  in  eliminating  the  deep  minima  from  the  MENSA  spectra.  The  superposed  plots  continued  to 
show  a  lobed  structure,  though  more  complex  and  less  regular  than  the  consistent  five-lobe  pattern  of 
Fig.  7.  The  results  of  using  both  Hamming  windowing  and  90%  preemphasis  are  shown  in  Fig.  9.  The 
lobes  at  400  Hz,  1200  Hz,  and  3600  Hz  are  still  apparent,  but  the  pattern  is  blurred  between  2000  Hz 
and  3800  Hz. 
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Fig.  8  —  MESA  spectra— 100  rrames  overlaid 
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Fig.  7  —  MENSA  spectra— 100  frames  overlaid 
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Fig.  9  —  MENSA  spectra  from  windowed,  preemphasized  speech— 
too  frames  overlaid 

In  Fig.  10  we  compare  the  "actual”  power  spectrum  Q  /  with  S}  and  S  /.  Both  estimates  appear  to 
be  smoothed  versions  of  Q  /.  Figure  11  shows  the  analogous  comparison  for  ///.  Here  there  is  more 
of  a  difference.  Because  of  the  deep  minima  of  S*,  it  appears  more  reasonable  to  interpret  S  /  than  Sy 
as  a  smoothed  version  of  Q  /. 

Three  distortion  measures  for  the  MESA  and  MENSA  spectra  S  and  S*  as  estimates  of  Q  were 
computed  for  each  frame:  <f/s(Q,S)  and  disiQ.S'),  daoiQ.S)  and  tfco(Q,S*),  and  </c£(Q»S)  and 
d^(Q,S*).  The  computations  were  done  for  three  values  of  M.  The  results,  averaged  over  all  100 
frames,  are  shown  in  Table  1.  In  one  case  the  mean  distortion  for  MENSA  is  slightly  less  than  that  for 
MESA,  the  difference  being  in  the  third  decimal  place.  In  every  other  case  the  mean  distortion  for 
MESA  is  less.  This  is  true  even  for  the  "cross-entropy"  distortions  dc£,  which  might  have  been 
expected  to  favor  MENSA. 
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Fig.  10  —  MESA  and  MENSA  estimates  with  the  Fourier  transform  of  the 
measured  autocorrelations  (///) 
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Fig.  1 1  —  MESA  and  MENSA  estimates  with  the  Fourier  transform  of  the 
measured  autocorrelations  (///) 


Table  1  —  Distortion  Results 


Window 


None 

None 

None 

Hamming 

Hamming 

Hamming 


Hamming 

Hamming 

Hamming 


Itakura-Saito 
Distortion  d^ 

MESA 

MENSA 

1.6  X  10” 

3.7  X  10*’ 

0.204 

5.6  X  10** 

0.589 

6.1  X  10** 

0.502 

3.3  X  10** 

0.310 

2.5  X  10*’ 

0.434 

6.5  X  10*5 

0.379 

1.1  X  10*’ 

0.265 

1.1  xlO* 

0.553 

1.6  X  10” 

0.446 

8.7  X  10*’ 

0.290 

4.2  X  10*’ 

Gain-Optimized 
Itakura-Saito 
Distortion  dco 


MESA  MENSA 


9.970 
10.549 
0.204  4.352 

0.589  12.340 

0.502  11.025 

0.310  4.776 


15.265 

16.347 

13.203 


Cross-Entropy 
Distortion  dcE 


MESA  MENSA 


361 
307 

185  I  0.290 


The  dc£  results  certainly  do  not  favor  MESA  as  overwhelmingly  as  those  from  the  other  two  dis¬ 
tortion  measures— especially  d/g.  The  enormous  Itakura-Saito  distortions  of  the  MENSA  spectra  are 
the  result  of  the  deep  minima  of  the  MENSA  estimates.  The  expression  for  d/^CQ.S*)  contains  the 
term  Qk/SH,  which  b^omes  extremely  large  when  the  estimate  5)^*  is  nearly  zero.  The  other  two  distor¬ 
tion  measures  contain  such  a  term  only  logarithmically.  Thus  d/s  penalizes  underestimates  more 
severely  than  do  dgo  dcE- 
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Two  columns  of  the  table  are  identical;  it  appears  that  d/s(Q,  S)  —  S).  This  is  not  a 

coincidence  but  is  a  property  of  dis  and  dgo-  The  equality  can  be  shown  to  hold  provided  that  bS  is  a 
MESA  spectrum  and  that  bQ  is  a  spectrum  that  satisfies  the  same  autocorrelation  constraints  that  deter¬ 
mine  bS.  A  proof  can  be  based  on  the  "correlation  matching"  property  [39,29]  of  MESA  spectra. 

Comparison  of  Synthetic  Speech 

Although  results  such  as  in  Figs.  7  and  11  suggest  that  S  is  better  than  S*,  the  separation  is 
hardly  compelling.  This  is  a  case  where  the  proof  must  be  in  the  hearing.  Consequently  we  syn¬ 
thesized  the  entire  test  sentence  using  standard  LPC  methods  and  using  the  29th-order  LPC  approxima¬ 
tions  to  Sj  and  S^.  y  «-  1,  ....  100,  as  discussed  at  the  end  of  the  preceding  main  section.  The  29th- 
order  LPC  approximations  to  S},  S  S  /,  and  S  /  are  also  plotted  in  Figs.  3  through  6.  The  two  curves 
are  indistinguishable  in  Figs.  3,  4,  and  6;  the  only  discrepancy  is  for  5/  (Fig.  S).  In  that  case  the  29th- 
order  approximation  is  unable  to  match  the  deep  nulls  and  also  exhibits  some  peak  splitting. 

The  LPC  speech  and  the  speech  based  on  S  sounded  identical,  adding  further  confidence  to  the 
discrete  frequency  approximations.  The  two  versions  based  on  S  and  S*  sounded  different,  but— 
somewhat  to  our  surprise— we  and  others  judged  them  to  be  equally  intelligible.  There  was,  however,  a 
distinct  qualitative  difference  when  preemphasis  was  not  used.  The  speech  based  on  S*  was  qualita¬ 
tively  inferior— it  had  a  distinct  ringing  quality,  as  though  spoken  from  the  other  end  of  a  long,  wide 
pipe.  When  preemphasis  was  used,  alone  or  with  Hamming  windowing,  the  ringing  quality  was  greatly 
reduced  or  effectively  eliminated.  Hamming  windowing  alone  reduced  the  ringing  only  slightly.  We 
hypothesize  that  this  ringing  effect  is  a  reflection  of  the  characteristic  lobe  structure  and  deep  minima  of 
the  spectral  estimates  S’,  since  the  ringing  is  most  prominent  when  the  lobing  is  most  prominent  and 
regular.  However,  the  ringing  can  be  almost  imperceptible  while  lobing  is  still  plainly  visible  in  spectral 
plots. 

CONCLUSION 

Primarily  on  the  basis  of  the  results  of  speech  synthesis,  but  also  on  results  like  Fig.  7,  Fig.  11, 
and  Table  1,  we  believe  that  MESA  (S)  yields  better  power  spectrum  estimates  for  speech  processing 
than  does  MENSA  (S’).  This  empirical  conclusion  also  supports  the  theoretical  discussion  in  the  last 
paragraph  of  the  background  section. 
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